Skip to content

[Stack 15/27] Fix D9: z-score thresholds from two-tailed to one-tailed#2446

Closed
jucor wants to merge 7 commits into
jc/fix-test-db-connectionfrom
jc/clj-parity-d9-fix
Closed

[Stack 15/27] Fix D9: z-score thresholds from two-tailed to one-tailed#2446
jucor wants to merge 7 commits into
jc/fix-test-db-connectionfrom
jc/clj-parity-d9-fix

Conversation

@jucor

@jucor jucor commented Mar 13, 2026

Copy link
Copy Markdown
Collaborator

Summary

Stacked on #2443 (Fix test DB connection: use DATABASE_URL with dotenv). Please review and merge #2443 first.
Next in stack: #2448 (Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount))

  • Fix D9: change z-score significance thresholds from two-tailed to one-tailed, matching Clojure's stats.clj
  • Z_90: 1.645 → 1.2816, Z_95: 1.96 → 1.6449
  • Also resolves an internal inconsistency — Python's own stats.py already used the correct one-tailed values

Why one-tailed?

The proportion tests in Polis check whether a comment's agree (or disagree) rate is significantly above 0.5 — a directional hypothesis. One-tailed is correct because we only care about one direction at a time. The two-tailed values were 28% more conservative, causing fewer comments to pass significance.

Test plan

  • TDD: removed xfail from 3 D9 tests, confirmed red (3 failures), applied fix, confirmed green
  • Discrepancy tests: 63 passed, 6 skipped, 50 xfailed (all 7 datasets including private)
  • Regression tests: 19 passed (all 7 datasets, golden snapshots re-recorded)
  • Repness unit tests: 36 passed (boundary values updated to match new thresholds)
  • 4 pre-existing failures unrelated to D9 (PCA incremental blobs, DB-dependent tests)

🤖 Generated with Claude Code

@jucor jucor changed the base branch from jc/vectorize-participant-info to jc/fix-test-db-connection March 13, 2026 14:12
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 0f32d4c to abfeacc Compare March 13, 2026 14:13
@jucor jucor changed the title Fix D9: z-score thresholds from two-tailed to one-tailed [Stack 13/13] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 13, 2026
@jucor jucor requested a review from Copilot March 13, 2026 14:14
@jucor jucor force-pushed the jc/fix-test-db-connection branch from e2f39bb to 89297cf Compare March 13, 2026 14:17

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates Delphi’s repness significance thresholds to use one-tailed z-score cutoffs (aligning with the Clojure implementation and existing stats.py expectations), and refreshes affected tests and golden snapshots.

Changes:

  • Update Z_90/Z_95 constants in repness.py to one-tailed thresholds (1.2816 / 1.6449).
  • Un-xfail and adjust unit/discrepancy tests to assert the new thresholds (including boundary assertions).
  • Re-record regression golden snapshot data reflecting the new repness behavior.

Reviewed changes

Copilot reviewed 19 out of 21 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
delphi/polismath/pca_kmeans_rep/repness.py Switch Z-score thresholds to one-tailed constants and update inline documentation.
delphi/tests/test_repness_unit.py Update z-score significance unit tests for the new thresholds.
delphi/tests/test_old_format_repness.py Update backwards-compat repness tests for the new thresholds.
delphi/tests/test_discrepancy_fixes.py Remove D9 xfail markers now that thresholds match expected values.
delphi/tests/simplified_repness_test.py Update the script constant to the new Z_90 value.
delphi/real_data/r6vbnhffkxbd7ifmfbdrd-vw/golden_snapshot.json Update golden snapshot outputs after repness threshold change.
delphi/docs/PLAN_DISCREPANCY_FIXES.md Add task parallelization notes related to discrepancy fix sequencing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread delphi/polismath/pca_kmeans_rep/repness.py Outdated
Comment thread delphi/docs/PLAN_DISCREPANCY_FIXES.md Outdated
@jucor jucor force-pushed the jc/fix-test-db-connection branch from 89297cf to 403be8d Compare March 13, 2026 16:07
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 9149f7f to e3c5893 Compare March 13, 2026 16:07
@jucor jucor changed the title [Stack 13/13] Fix D9: z-score thresholds from two-tailed to one-tailed [Stack 13/15] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 16, 2026
@jucor jucor force-pushed the jc/fix-test-db-connection branch from 403be8d to 464da16 Compare March 16, 2026 16:04
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from db36889 to 69350d5 Compare March 16, 2026 16:04
@jucor jucor changed the title [Stack 13/15] Fix D9: z-score thresholds from two-tailed to one-tailed [Stack 13/16] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 16, 2026
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 69350d5 to 382de2f Compare March 16, 2026 18:06
@jucor jucor changed the title [Stack 13/16] Fix D9: z-score thresholds from two-tailed to one-tailed [Stack 13/17] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 16, 2026
@jucor jucor changed the title [Stack 13/17] Fix D9: z-score thresholds from two-tailed to one-tailed [Stack 13/24] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 17, 2026
@jucor jucor changed the title [Stack 13/24] Fix D9: z-score thresholds from two-tailed to one-tailed [Stack 13/25] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 17, 2026
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from f8a7007 to 19cce44 Compare March 19, 2026 10:03
@jucor jucor force-pushed the jc/fix-test-db-connection branch from 6b81d8e to 6da6172 Compare March 19, 2026 10:43
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 19cce44 to bf2dd99 Compare March 19, 2026 10:43
@jucor jucor changed the title [Stack 13/25] Fix D9: z-score thresholds from two-tailed to one-tailed [Stack 12/24] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 19, 2026
@jucor jucor force-pushed the jc/fix-test-db-connection branch from 6da6172 to 0ef54ca Compare March 19, 2026 12:31
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from bf2dd99 to f8c5793 Compare March 19, 2026 12:31
@jucor jucor force-pushed the jc/fix-test-db-connection branch from 0ef54ca to b62c0cd Compare March 19, 2026 14:52
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from f8c5793 to 7f733c1 Compare March 19, 2026 14:52
@jucor jucor changed the title [Stack 12/24] Fix D9: z-score thresholds from two-tailed to one-tailed [Stack 13/25] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 19, 2026
@jucor jucor force-pushed the jc/fix-test-db-connection branch from b62c0cd to 83bfe23 Compare March 23, 2026 15:09
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 7f733c1 to c920b61 Compare March 23, 2026 15:11
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 09747ea to 8d94246 Compare March 27, 2026 10:41
@jucor jucor force-pushed the jc/fix-test-db-connection branch from 4ea1333 to 917cba8 Compare March 27, 2026 10:41
@jucor jucor changed the title [Stack 13/25] Fix D9: z-score thresholds from two-tailed to one-tailed [Stack 14/26] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 30, 2026
@jucor jucor force-pushed the jc/fix-test-db-connection branch from 917cba8 to e45a120 Compare March 30, 2026 12:48
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 8d94246 to 9397ddf Compare March 30, 2026 12:48
@jucor jucor changed the title [Stack 14/26] Fix D9: z-score thresholds from two-tailed to one-tailed [Stack 15/27] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 30, 2026
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 9397ddf to e96a1f7 Compare March 30, 2026 12:54
@github-actions

Copy link
Copy Markdown

Delphi Coverage Report

File Stmts Miss Cover
init.py 2 0 100%
benchmarks/bench_pca.py 76 76 0%
benchmarks/bench_repness.py 81 81 0%
benchmarks/bench_update_votes.py 38 38 0%
benchmarks/benchmark_utils.py 34 34 0%
components/init.py 1 0 100%
components/config.py 165 133 19%
conversation/init.py 2 0 100%
conversation/conversation.py 1107 320 71%
conversation/manager.py 131 42 68%
database/init.py 1 0 100%
database/dynamodb.py 387 234 40%
database/postgres.py 305 205 33%
pca_kmeans_rep/init.py 5 0 100%
pca_kmeans_rep/clusters.py 257 22 91%
pca_kmeans_rep/corr.py 98 17 83%
pca_kmeans_rep/pca.py 52 16 69%
pca_kmeans_rep/repness.py 297 43 86%
regression/init.py 4 0 100%
regression/clojure_comparer.py 188 17 91%
regression/comparer.py 887 720 19%
regression/datasets.py 135 27 80%
regression/recorder.py 36 27 25%
regression/utils.py 138 87 37%
run_math_pipeline.py 260 114 56%
umap_narrative/500_generate_embedding_umap_cluster.py 210 109 48%
umap_narrative/501_calculate_comment_extremity.py 112 53 53%
umap_narrative/502_calculate_priorities.py 135 135 0%
umap_narrative/700_datamapplot_for_layer.py 502 502 0%
umap_narrative/701_static_datamapplot_for_layer.py 310 310 0%
umap_narrative/702_consensus_divisive_datamapplot.py 432 432 0%
umap_narrative/801_narrative_report_batch.py 785 785 0%
umap_narrative/802_process_batch_results.py 265 265 0%
umap_narrative/803_check_batch_status.py 175 175 0%
umap_narrative/llm_factory_constructor/init.py 2 2 0%
umap_narrative/llm_factory_constructor/model_provider.py 157 157 0%
umap_narrative/polismath_commentgraph/init.py 1 0 100%
umap_narrative/polismath_commentgraph/cli.py 270 270 0%
umap_narrative/polismath_commentgraph/core/init.py 3 3 0%
umap_narrative/polismath_commentgraph/core/clustering.py 108 108 0%
umap_narrative/polismath_commentgraph/core/embedding.py 104 104 0%
umap_narrative/polismath_commentgraph/lambda_handler.py 219 219 0%
umap_narrative/polismath_commentgraph/schemas/init.py 2 0 100%
umap_narrative/polismath_commentgraph/schemas/dynamo_models.py 160 9 94%
umap_narrative/polismath_commentgraph/tests/conftest.py 17 17 0%
umap_narrative/polismath_commentgraph/tests/test_clustering.py 74 74 0%
umap_narrative/polismath_commentgraph/tests/test_embedding.py 55 55 0%
umap_narrative/polismath_commentgraph/tests/test_storage.py 87 87 0%
umap_narrative/polismath_commentgraph/utils/init.py 3 0 100%
umap_narrative/polismath_commentgraph/utils/converter.py 283 237 16%
umap_narrative/polismath_commentgraph/utils/group_data.py 354 336 5%
umap_narrative/polismath_commentgraph/utils/storage.py 584 518 11%
umap_narrative/reset_conversation.py 159 50 69%
umap_narrative/run_pipeline.py 453 312 31%
utils/general.py 62 41 34%
Total 10770 7618 29%

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 44 to 48
Returns:
True if significant at 90% confidence
"""
return abs(z) >= Z_90
return z > Z_90

Copilot AI Mar 30, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The significance helpers were switched to one-tailed semantics (z > Z_90), but Z_90 is still set to 1.645 (two-tailed 90%). To match Clojure’s z-sig-90? (> 1.2816), update the constant (and its comment) to 1.2816; otherwise the new one-tailed check remains overly conservative and D9 parity tests can’t pass once enabled.

Copilot uses AI. Check for mistakes.
Comment on lines 57 to 61
Returns:
True if significant at 95% confidence
"""
return abs(z) >= Z_95
return z > Z_95

Copilot AI Mar 30, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as Z_90: z_score_sig_95 now uses one-tailed strict z > Z_95, but Z_95 is still 1.96 (two-tailed 95%). Clojure’s z-sig-95? uses 1.6449; update Z_95 accordingly so 95% gating matches the reference implementation.

Copilot uses AI. Check for mistakes.
Comment on lines 555 to 565
@@ -562,6 +564,20 @@ def test_z95_matches_clojure(self):
check.almost_equal(Z_95, 1.6449, abs=0.001,
msg=f"Z_95 should be 1.6449 (one-tailed), got {Z_95}")

Copilot AI Mar 30, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The D9 threshold value assertions are still marked @pytest.mark.xfail, which means this PR’s stated purpose (changing Z_90/Z_95 values) isn’t actually enforced by CI. Once the constants are updated, these xfails should be removed so regressions in the threshold values will fail the suite.

Copilot uses AI. Check for mistakes.
z_score = two_prop_test(70, 100, 50, 100) # 70/100 vs 50/100
print(f"Comparison Z-score: {z_score}")
```
Statistical helpers (`z_sig_90`, `z_sig_95`, `prop_test`, `two_prop_test`) are

Copilot AI Mar 30, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc references z_sig_90/z_sig_95, but repness.py defines z_score_sig_90/z_score_sig_95 (and stats.py has been removed). Update the names here (or add documented aliases) so the docs reflect the actual public API.

Suggested change
Statistical helpers (`z_sig_90`, `z_sig_95`, `prop_test`, `two_prop_test`) are
Statistical helpers (`z_score_sig_90`, `z_score_sig_95`, `prop_test`, `two_prop_test`) are

Copilot uses AI. Check for mistakes.
@jucor jucor force-pushed the jc/fix-test-db-connection branch from b68bd5b to 583f955 Compare March 30, 2026 16:49
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from e96a1f7 to 574c169 Compare March 30, 2026 16:49
jucor and others added 7 commits March 30, 2026 18:04
Map dependency graph and file boundaries for D5-D12. Two parallel
tracks possible: repness formulas (D5→D6→D7→D8→D10→D11) and
conversation/PCA (D3→D15→D12), with D1/D1b after both.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ead stats.py

repness.py: change z_score_sig_90/95 from abs(z) >= threshold to z > threshold,
matching Clojure's (> z-val 1.2816). Also fix inline significance filters in
select_rep_comments_df to use > without .abs().

Remove stats.py and test_stats.py — unused dead code from the original AI-generated
Clojure port. repness.py defines its own z-sig functions and never imported stats.py.

Update usage_examples.md to point to repness.py instead of deleted stats.py.

Add D9 blob comparison tests (significance sets and z-values, xfail until D5/D6),
D5/D6/D7/D8 blob comparison tests with shared_count guards, tid type fixes,
and D9 unit tests for strict > and one-tailed semantics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Require side-by-side Clojure/Python verification for every formula change
- Exhaustive RED phase: boundary conditions, edge cases, missing test audit
- Double-check array shapes, indices, aggregation axes after implementation
- Allow skipping very large private datasets during iteration, only run as final validation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR 14 (vectorized code refactor) is now a prerequisite for all formula
fix PRs, not a post-parity cleanup. It branches off jc/clj-parity-d9-fix
(Stack 13) to make the vectorized production path readable and testable
against Clojure blob values. Remaining dead code cleanup split to PR 14b.

Handoff doc at delphi/docs/HANDOFF_PR14_VECTORIZED_REFACTOR.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The D9 z-score threshold change (two-tailed → one-tailed) combined
with upstream fixes cascaded into this branch changed the repness
output enough to invalidate the vw golden snapshot. Biodiversity
re-recorded too for consistency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 574c169 to b64cae8 Compare March 30, 2026 17:05
@jucor jucor force-pushed the jc/fix-test-db-connection branch from 583f955 to 380b00d Compare March 30, 2026 17:05
This was referenced Mar 30, 2026
@jucor

jucor commented Mar 30, 2026

Copy link
Copy Markdown
Collaborator Author

Superseded by spr-managed PR stack. See the new stack starting at #2508.

@jucor jucor closed this Mar 30, 2026
jucor added a commit that referenced this pull request Jun 11, 2026
* Vectorize participant info computation (3-15x speedup)

## Summary


- Replaces the O(N×G×C) per-participant Python loop in `_compute_participant_info_optimized` with bulk NumPy operations: matrix-wide vote counting (`np.sum` over axis) and per-group Pearson correlation via `P @ g` matrix multiply
- Adds 31 unit tests covering vote counts, group correlations, edge cases (small groups, zero-std, NaN handling, missing members), and golden snapshot regression
- Correlations now return Python `float` instead of `numpy.float64`
- Includes a benchmark script (`scripts/benchmark_participant_info.py`) that runs old vs new on the same data

## Benchmark results

Measured on real datasets (5 runs, median), old per-participant loop vs new vectorized:

| Dataset | Size | Old | New | Speedup |
|---------|------|-----|-----|---------|
| vw | 69p × 125c × 4g | 0.011s | 0.001s | **14.6x** |
| biodiversity | 536p × 314c × 2g | 0.047s | 0.006s | **8.1x** |
| _(larger private datasets)_ | | | | **3–6x** |

Speedup is higher on smaller datasets (loop overhead dominates) and lower on very large ones (matrix materialization dominates). Overall **3–15x** depending on size.

## Test plan

- [x] 31 unit tests pass (pre-vectorization baseline established first, then re-run post)
- [x] Golden snapshot regression passes for biodiversity + vw
- [x] Full regression test suite passes (40/40)
- [x] Benchmark run on all datasets including private (results above)
- [x] Max correlation diff across all datasets: < 2e-15


🤖 Generated with [Claude Code](https://claude.com/claude-code)


## Squashed commits

- Address Copilot review: clarify MAP estimate, centralize PSEUDO_COUNT imports
- Update plan: add stack cross-reference and GitHub PR numbers
- Vectorize _compute_participant_info_optimized for ~100x speedup
- Add benchmark script for participant info vectorization
- Address Copilot review: drop unused cache, validate --runs, tighten type assert
- Remove old participant_stats() in favor of vectorized replacement

commit-id:ea747196

* Fix test DB connection: use DATABASE_URL with dotenv

## Summary


- **Absorbed PR #2434** (DB connect_timeout): adds `connect_timeout=5` to all DB connections so tests fail fast instead of hanging when Postgres is unavailable. Reverts timeout from production code (a 5s timeout could break real deployments), keeping it only in test code.
- Adds `require_dynamodb()` and `require_s3()` helpers to conftest that probe services with short timeouts and `pytest.fail()` immediately — applied to all tests that previously hung when Docker services were down.
- Adds `pytest-timestamper` for per-test timing visibility.
- Fixes `test_postgres_real_data.py` to use `DATABASE_URL` (consistent with all other delphi Python code) via `python-dotenv`.

Refs #2442

## Test plan

- [x] `test_conversation_from_postgres` passes
- [x] `test_pakistan_conversation_batch` passes (9 min, 400K votes)
- [x] Full delphi test suite: 294 passed, no regressions

🤖 Generated with [Claude Code](https://claude.com/claude-code)


## Squashed commits

- Add connect_timeout to all DB connections, fail tests on DB unavailable
- Fail tests fast when services are unavailable instead of hanging
- Fix conftest docstring: require_dynamodb/require_s3, not require_service
- Skip (not fail) test_minio_access when MinIO is unavailable
- Fail test_conversation_from_postgres fast when DynamoDB is unavailable
- Fail test_run_math_pipeline_e2e fast when DynamoDB is unavailable
- Add pytest-timestamper and fix test_501 hang on Docker unavailable
- Defer heavy import in test_umap_narrative_pipeline (defensive)
- Address Copilot review: fix timing_stats_compared None bug, fix return type
- Fix test DB connection: use DATABASE_URL with dotenv
- Fix CI: use find_dotenv() and fall back to DATABASE_* vars
- Restore get_or_compute_conversation fixture and test reordering hook
- Remove dead xdist_group markers from make_dataset_params

commit-id:b9062b50

* Fix D9: z-score thresholds from two-tailed to one-tailed

## Summary


- Fix D9: change z-score significance thresholds from two-tailed to one-tailed, matching Clojure's `stats.clj`
- `Z_90`: 1.645 → 1.2816, `Z_95`: 1.96 → 1.6449
- Also resolves an internal inconsistency — Python's own `stats.py` already used the correct one-tailed values

## Why one-tailed?

The proportion tests in Polis check whether a comment's agree (or disagree) rate is **significantly above 0.5** — a directional hypothesis. One-tailed is correct because we only care about one direction at a time. The two-tailed values were 28% more conservative, causing fewer comments to pass significance.

## Test plan

- [x] TDD: removed xfail from 3 D9 tests, confirmed red (3 failures), applied fix, confirmed green
- [x] Discrepancy tests: 63 passed, 6 skipped, 50 xfailed (all 7 datasets including private)
- [x] Regression tests: 19 passed (all 7 datasets, golden snapshots re-recorded)
- [x] Repness unit tests: 36 passed (boundary values updated to match new thresholds)
- [x] 4 pre-existing failures unrelated to D9 (PCA incremental blobs, DB-dependent tests)

🤖 Generated with [Claude Code](https://claude.com/claude-code)


## Squashed commits

- Plan: add task parallelization analysis for remaining fixes
- Fix D9: match Clojure z-sig semantics (strict >, no abs) and remove dead stats.py
- Re-record vw golden snapshot after D9 z-sig semantics change
- Update plan: mark D9 as done, note stats.py removal for next PR
- Add mathematical rigor and exhaustive testing guidance to fix plan
- Plan: move PR 14 earlier (prerequisite for blob tests) + add handoff doc
- Re-record golden snapshots after upstream cascade

commit-id:0194003d

* Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount)

## Summary


Replace Python's standard one-proportion z-test `prop_test(p, n, p0)` with
Clojure's Wilson-score-like formula `prop_test(succ, n)` from `stats.clj:10-15`:

```
2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)
```

The Clojure formula has a built-in +1 pseudocount (Laplace smoothing / Beta(1,1)
prior) that regularizes extreme values for small Polis groups. This is separate
from the `PSEUDO_COUNT=2.0` used for `pa`/`pd` estimation (Beta(2,2) prior):

- `pa = (na + 1) / (ns + 2)` — Beta(2,2) prior for probability estimation
- `pat = 2 * sqrt(ns+1) * ((na+1)/(ns+1) - 0.5)` — Beta(1,1) prior for significance testing

**What changed in the output**: `pat`, `pdt` values (proportion test z-scores),
and downstream `agree_metric` / `disagree_metric` values. The z-scores are
now slightly different due to `sqrt(n+1)` vs `sqrt(n)` and `(succ+1)/(n+1)` vs
`(na+1)/(n+2)` denominators.

## Changes
- `repness.py`: `prop_test(p, n, p0)` → `prop_test(succ, n)` with Clojure formula
- `repness.py`: `prop_test_vectorized(p, n, p0)` → `prop_test_vectorized(succ, n)`
- `repness.py`: Callers updated to pass raw counts `(na, ns)` instead of `(pa, ns, 0.5)`
- `test_discrepancy_fixes.py`: Removed xfail from D5 formula test, added 8 test cases + edge case
- `test_repness_unit.py`, `test_old_format_repness.py`: Updated for new signature
- Golden snapshots re-recorded for all datasets

## Test plan
- [x] D5 formula tests pass (8 input pairs + edge cases)
- [x] D5 Clojure blob consistency check passes (all datasets)
- [x] Full test suite passes (public + private, 19/19 regression tests)
- [x] Only pre-existing failure: pakistan-incremental D2 (unrelated)

🤖 Generated with [Claude Code](https://claude.com/claude-code)


## Squashed commits

- RED: add D5 blob injection test (prop_test vs Clojure p-test values)
- Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount)
- Update plan and journal: mark D5 as done
- Plan: add D5 PR number and stack position to cross-reference

commit-id:48b77ba3

* Fix D6: match Clojure two-proportion test formula (+1 pseudocount)

## Summary


The Python `two_prop_test` used a standard two-proportion z-test with no pseudocounts,
while Clojure's `stats/two-prop-test` (stats.clj:18-33) adds +1 to all four inputs
(`succ-in`, `succ-out`, `pop-in`, `pop-out`) via `(map inc ...)` before computing
the pooled z-test. This Laplace smoothing regularizes z-scores for small group sizes,
which are common in Polis conversations.

## Changes
- **Signature change**: `two_prop_test(p1, n1, p2, n2)` (proportions) →
  `two_prop_test(succ_in, succ_out, pop_in, pop_out)` (raw counts)
- **Formula**: Standard pooled z-test on pseudocount-adjusted values:
  `pi1 = (succ_in+1)/(pop_in+1)`, `pi_hat = (s1+s2)/(p1+p2)`
- **Callers updated**: Both scalar (`add_comparative_stats`) and vectorized
  (`compute_group_comment_stats_df`) now pass raw counts matching Clojure's
  `(stats/two-prop-test (:na in-stats) (sum :na rest-stats) (:ns in-stats) (sum :ns rest-stats))`
  (repness.clj:97-100)

## Affected output fields
- `rat` (agree representativeness test z-score)
- `rdt` (disagree representativeness test z-score)
- `agree_metric`, `disagree_metric` (downstream of rat/rdt)

## Test plan
- [x] Targeted D6 tests pass (formula, edge cases, regularization effect)
- [x] Full test suite passes (excluding DynamoDB/MinIO tests)
- [x] Private dataset tests pass (--include-local)
- [x] Golden snapshots re-recorded for all 7 datasets

🤖 Generated with [Claude Code](https://claude.com/claude-code)


## Squashed commits

- RED: add D6 blob injection test (two_prop_test vs Clojure repness-test)
- Fix D6: match Clojure two-proportion test formula (+1 pseudocount)
- Plan: add D6 PR number and stack position to cross-reference

commit-id:23c03d70

* Fix D7: match Clojure repness metric formula (product of 4 signed values)

## Summary


Changes the representativeness metric from a weighted sum of absolutes to the
Clojure product formula (repness.clj:188-190).

**Before (Python):**
- agree_metric = `pa * (|pat| + |rat|)` — weighted sum, tolerant of weak factors
- disagree_metric = `(1 - pd) * (|pdt| + |rdt|)` — doubly wrong: uses `(1-pd)` and sum

**After (Clojure formula):**
- agree_metric = `ra * rat * pa * pat` — product of 4 signed values
- disagree_metric = `rd * rdt * pd * pdt` — product of 4 signed values

The product formula is more conservative: any factor near zero kills the entire
metric, requiring ALL dimensions (probability, significance, relative
representativeness) to be strong simultaneously.

The old disagree formula was doubly wrong:
1. Used `(1 - pd)` instead of `pd` — high metric when disagree probability is LOW
2. Used a weighted sum of absolutes instead of a signed product

No feature flag for the old formula — it has no defensible behavior.

## Test plan
- [x] 5 new D7 formula tests (agree product, disagree product, zero-kills-metric,
      sign preservation, multiple known values)
- [x] Updated unit tests in test_repness_unit.py and test_old_format_repness.py
- [x] Full test suite passes (excluding DynamoDB/MinIO tests)
- [x] Private dataset tests pass (--include-local)
- [x] Golden snapshots re-recorded for all 7 datasets

🤖 Generated with [Claude Code](https://claude.com/claude-code)


## Squashed commits

- Add PR description and update plan/journal for D7 fix

commit-id:80eaa87c

* Fix D8: match Clojure repful classification (rat > rdt)

## Summary


Simplifies the repful ("representative for agree or disagree?") classification
to match Clojure's `finalize-cmt-stats` (repness.clj:175-177).

**Before (Python):** 3-branch conditional:
1. `pa > 0.5 AND ra > 1.0` → agree
2. `pd > 0.5 AND rd > 1.0` → disagree
3. Fallback: whichever metric is higher

**After (Clojure):** `rat > rdt` → agree, else disagree.

The old thresholds were redundant — `rat` and `rdt` (two-proportion z-scores)
already encode whether the group's agree/disagree rate is significantly higher
than other groups. The simple comparison is both correct and clearer.

## Changes
- `repness.py`: `finalize_cmt_stats()` — 3-branch logic → `rat > rdt`
- `repness.py`: Vectorized — `np.select` with conditions → `np.where(rat > rdt)`
- `test_discrepancy_fixes.py`: Expanded from 2 to 6 tests (including edge cases:
  equal rat/rdt, both negative, both zero)
- Golden snapshots re-recorded (repful direction changes for some comments)

## Test plan
- [x] 6 targeted D8 tests pass (rat>rdt, rat<rdt, equal, both negative, both zero, old-vs-new divergence case)
- [x] Full test suite passes (excluding DynamoDB/MinIO tests)
- [x] Private dataset tests pass (--include-local)
- [x] Golden snapshots re-recorded for all 7 datasets
- [x] 19/19 regression tests pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)


## Squashed commits

- Journal: add review session entry (2026-03-17)

commit-id:9ec28252

* Fix D15: match Clojure moderation handling (zero out columns, don't remove)

## Summary


Python's `_apply_moderation()` removed moderated-out comment columns entirely
from `rating_mat`. Clojure's `zero-out-columns` (named_matrix.clj:214-230) sets
all values in moderated columns to 0, preserving matrix structure.

This fix changes Python to match:
- Moderated-out comment columns are **zeroed** (values set to 0.0), not removed
- `rating_mat` retains the same column count as `raw_rating_mat`
- Moderated-out participants (rows) are still removed — unchanged

### Why zeroing matters

- **Matrix dimensions**: Clojure's `rating-mat` has the same shape as `raw-rating-mat`.
  Downstream code (PCA, repness) processes the same-shaped matrix.
- **tids list**: Column indices stay stable. Consumers depend on this.
- **Practical impact**: Zeroed columns have no signal (na=0, nd=0), so they fail
  significance tests and are excluded from repness/consensus. PCA sees zero variance.

## Changes
- `conversation.py`: `_apply_moderation()` — zero out columns instead of removing
- `test_discrepancy_fixes.py`: 5 new synthetic tests + 2 enhanced real-data tests
- `test_conversation.py`: Updated to expect zeroed columns

## Test plan
- [x] Synthetic tests: zeroing preserves columns, values are 0, non-moderated unchanged
- [x] Real-data test: biodiversity-incremental (169 mod-out comments)
- [x] Full public test suite: 328 passed, 0 failed
- [x] TDD cycle: RED (2 failures) → GREEN (all pass)

🤖 Generated with [Claude Code](https://claude.com/claude-code)


## Squashed commits

- Fix D15: match Clojure moderation handling (zero out columns, don't remove)

commit-id:c3450b9a

* Fix K-means k divergence: preserve vote-encounter row order

## Summary


- Fix K-means k divergence between Python and Clojure by preserving vote-encounter order for participant rows in the rating matrix
- Python was using `natsorted()` (PID-numeric order) while Clojure's NamedMatrix preserves insertion order — different row ordering cascades into different first-k-distinct initialization seeds for group-level k-means
- On vw: Python picked k=4 (wrong), Clojure picks k=2 — now both pick k=2 with identical cluster memberships

## Investigation findings

The divergence chain: rating_mat row order → PCA projection order → base-cluster ID assignment → group k-means first-k-distinct init → different local optima → different silhouette landscape → different k.

PCA components are identical (cosine similarity = 1.0), silhouette implementation matches, k-means algorithm matches — only the data ORDER feeding first-k-distinct differed.

## Changes

- `conversation.py`: `update_votes()` preserves vote-encounter order for participant rows instead of `natsorted()`
- `conversation.py`: `_apply_moderation()` preserves row order with list comprehension
- Column (comment ID) ordering remains `natsorted` — doesn't affect clustering
- Re-recorded vw cold-start blob and golden snapshots
- Updated ordering tests, removed `test_group_clustering` xfail
- Added `scripts/investigate_k_divergence.py` diagnostic tool

## Cold-start blob results

| Dataset | Clj k | Py k | Match |
|---------|-------|------|-------|
| vw | 2 | 2 | exact (sizes [50,17]) |
| biodiversity | 2 | 2 | exact (sizes [81,19]) |
| bg2018 | 2 | 2 | close ([51,49] vs [52,48]) |
| FLI | 2 | 3 | inherent PCA divergence (94.5% NaN, sil gap 0.001) |

## Test plan

- [x] All 297 tests pass (0 failures, 58 xfailed)
- [x] vw cold-start: k=2 exact match with Clojure blob
- [x] biodiversity cold-start: k=2 exact match
- [x] Ordering tests updated to expect encounter order
- [ ] Re-record private dataset golden snapshots after stack rebase

🤖 Generated with [Claude Code](https://claude.com/claude-code)


## Squashed commits

- Fix K-means k divergence: preserve vote-encounter order for participant rows
- Update plan and journal: K-divergence investigation resolved
- Remove investigation script (one-off diagnostic, not production code)
- Rename k-divergence doc: investigation record, not a handoff
- Update references to renamed investigation doc

commit-id:4598a0a1

* Docs: plan + journal updates for the 2026-06-09/10 stack session

Plan and journal updates accumulated over the 2026-06-09 / 2026-06-10
sessions where the stack (#2516#2524) was audited, recovered, and
re-reviewed by Copilot. No code changes — just docs that summarize
what landed in the math-fix PRs below this one and what's pending for
the next stack.

Major sections added/updated:

CLJ-PARITY-FIXES-JOURNAL.md:
- "Session: Audit + Recovery (2026-06-09)" — log of the 9-PR
  multi-agent audit (20 confirmed / 26 refuted findings), the
  structural discovery that PR #2521 D7 and PR #2522 D8 had landed as
  docs-only commits with zero code changes, the recovery work, and the
  in-session-discovered follow-ups (D5 n=0 short-circuit, D15
  downstream regressions, to_dynamo_dict parallel inline code).
- "Copilot review follow-up (2026-06-10)" — outcomes of two rounds of
  Copilot review across the stack (13 substantive comments + 7 stale
  historical ones), including the moderated-out-participants
  (mod_out_ptpts) leak Copilot caught in PR #2523 and the
  serialization-shape regression test added to pin
  to_dict / to_dynamo_dict outputs against silent drift.
- "Operational impact of the mod_out_ptpts fix (prodclone
  verification, 2026-06-10)" — aggregate evidence that 67
  conversations in prodclone have at least one banned participant,
  some still actively voting in 2025; the Python-side fix isn't strict
  Clojure parity (Clojure has the same latent leak) but is the
  correct Polis behavior.

PLAN_DISCREPANCY_FIXES.md:
- All D-row entries updated to current status (D5, D6, D7, D8, D9, D15
  now DONE; D7 and D8 specifically recovered from the docs-only state
  the previous PRs left them in).
- Stale PR-number cross-references corrected
  (#2446 -> #2518 for D9, #2453 -> #2524 for K-divergence).
- "Audit-Discovered Issues (2026-06-09)" section — final state of all
  four cleanup PRs (#2516, #2517, #2523 follow-up, #2524 cleanups).
- "Pending — needs team discussion (2026-06-10)" section — the
  participant-mod / banned-participant semantics gap (waiting on
  Colin), golden-snapshot re-recording, scalar/vectorized pi_hat > 1
  edge case, to_dynamo_dict (now resolved).

Stack-tip docs PR to ride on top of the math fixes when the stack
merges.

commit-id:0ccf8c0a
@jucor jucor deleted the jc/clj-parity-d9-fix branch June 11, 2026 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants